Web Scraping in Python

In this appendix lecture we'll go over how to scrape information from the web using Python.

We'll go to a website, decide what information we want, see where and how it is stored, then scrape it and set it as a pandas DataFrame!

Some things you should consider before web scraping a website:

1.) You should check a site's terms and conditions before you scrape them.

2.) Space out your requests so you don't overload the site's server, doing this could get you blocked.

3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code.

4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it.

5.) Every web page and situation is different, you'll have to spend time configuring your scraper.

To learn more about HTML I suggest theses two resources:

W3School

Codecademy

There are three modules we'll need in addition to python are:

1.) BeautifulSoup, which you can download by typing: pip install beautifulsoup4 or conda install beautifulsoup4 (for the Anaconda distrbution of Python) in your command prompt.

2.) lxml , which you can download by typing: pip install lxml or conda install lxml (for the Anaconda distrbution of Python) in your command prompt.

3.) requests, which you can download by typing: pip install requests or conda install requests (for the Anaconda distrbution of Python) in your command prompt.

We'll start with our imports:


In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
import pandas as pd
from pandas import Series,DataFrame

For our quick web scraping tutorial, we'll look at some legislative reports from the University of California Web Page. Feel free to experiment with other webpages, but remember to be cautious and respectful in what you scrape and how often you do it. Always check the legality of a web scraping job.

Let's go ahead and set the url.


In [3]:
url = 'http://www.ucop.edu/operating-budget/budgets-and-reports/legislative-reports/2013-14-legislative-session.html'

Now let's go ahead and set up requests to grab content form the url, and set it as a Beautiful Soup object.


In [5]:
# Request content from web page
result = requests.get(url)
c = result.content

# Set as Beautiful Soup Object
soup = BeautifulSoup(c)

Now we'll use Beautiful Soup to search for the table we want to grab!


In [6]:
# Go to the section of interest
summary = soup.find("div",{'class':'list-land','id':'content'})

# Find the tables in the HTML
tables = summary.find_all('table')

Now we need to use Beautiful Soup to find the table entries. A 'td' tag defines a standard cell in an HTML table. The 'tr' tag defines a row in an HTML table.

We'll parse through our tables object and try to find each cell using the findALL('td') method.

There are tons of options to use with findALL in beautiful soup. You can read about them here.


In [7]:
# Set up empty data list
data = []

# Set rows as first indexed object in tables with a row
rows = tables[0].findAll('tr')

# now grab every HTML cell in every row
for tr in rows:
    cols = tr.findAll('td')
    # Check to see if text is in the row
    for td in cols:
        text = td.find(text=True) 
        print text,
        data.append(text)


1 08/01/13 2013-14 (EDU 92495) Proposed Capital Outlay Projects (2013-14 only) (pdf) 2 09/01/13 2014-15  (EDU 92495) Proposed Capital Outlay Projects (pdf) 3 11/01/13 Utilization of Classroom and Teaching Laboratories (pdf) 4 11/01/13 Instruction and Research Space Summary & Analysis (pdf) 5 11/15/13 Statewide Energy Partnership Program (pdf) 6 11/30/13 2013-23 Capital Financial Plan (pdf) 7 11/30/13 Projects Savings Funded from Capital Outlay Bond Funds (pdf) 8 12/01/13 Streamlined Capital Projects Funded from Capital (pdf) 9 01/01/14 Annual General Obligation Bonds Accountability (pdf) 10 01/01/14 Small Business Utilization (pdf) 11 01/01/14 Institutional Financial Aid Programs - Preliminary report (pdf) 12 01/10/14 Summer Enrollment (pdf) 13 01/15/14 Contracting Out for Services at Newly Developed Facilities (pdf) 14 03/01/14 Performance Measures (pdf) 15 03/01/14 Entry Level Writing Requirement (pdf) 16 03/31/14 Annual Report on Student Financial Support (pdf) 17 04/01/14 Unique Statewide Pupil Identifier (pdf) 18 04/01/14 Riverside School of Medicine (pdf) 19 04/01/14 SAPEP Funds and Outcomes - N/A 20 05/15/14 Receipt and Use of Lottery Funds (pdf) 21 07/01/14 Cogeneration and Energy Consv Major Capital Projects (pdf) 


 Future Reports 
24 12- Breast Cancer Research Fund 25 12-31-15 Cigarette and Tobacco Products Surtax Research Program 26 01-01-16 Best Value Program 27 01-01-16 California Subject Matter Programs 28 04-01-16 COSMOS Program Outcomes

Let's see what the data list looks like


In [8]:
data


Out[8]:
[u'1',
 u'08/01/13',
 u'2013-14 (EDU 92495) Proposed Capital Outlay Projects (2013-14 only) (pdf)',
 u'2',
 u'09/01/13',
 u'2014-15\xa0 (EDU 92495) Proposed Capital Outlay Projects (pdf)',
 u'3',
 u'11/01/13',
 u'Utilization of Classroom and Teaching Laboratories (pdf)',
 u'4',
 u'11/01/13',
 u'Instruction and Research Space Summary & Analysis (pdf)',
 u'5',
 u'11/15/13',
 u'Statewide Energy Partnership Program (pdf)',
 u'6',
 u'11/30/13',
 u'2013-23 Capital Financial Plan (pdf)',
 u'7',
 u'11/30/13',
 u'Projects Savings Funded from Capital Outlay Bond Funds (pdf)',
 u'8',
 u'12/01/13',
 u'Streamlined Capital Projects Funded from Capital (pdf)',
 u'9',
 u'01/01/14',
 u'Annual General Obligation Bonds Accountability (pdf)',
 u'10',
 u'01/01/14',
 u'Small Business Utilization (pdf)',
 u'11',
 u'01/01/14',
 u'Institutional Financial Aid Programs - Preliminary report (pdf)',
 u'12',
 u'01/10/14',
 u'Summer Enrollment (pdf)',
 u'13',
 u'01/15/14',
 u'Contracting Out for Services at Newly Developed Facilities (pdf)',
 u'14',
 u'03/01/14',
 u'Performance Measures (pdf)',
 u'15',
 u'03/01/14',
 u'Entry Level Writing Requirement (pdf)',
 u'16',
 u'03/31/14',
 u'Annual Report on Student\xa0Financial Support (pdf)',
 u'17',
 u'04/01/14',
 u'Unique Statewide Pupil Identifier (pdf)',
 u'18',
 u'04/01/14',
 u'Riverside School of Medicine (pdf)',
 u'19',
 u'04/01/14',
 u'SAPEP Funds and Outcomes - N/A',
 u'20',
 u'05/15/14',
 u'Receipt and Use of Lottery Funds (pdf)',
 u'21',
 u'07/01/14',
 u'Cogeneration and Energy Consv Major Capital Projects (pdf)',
 u'\n',
 u'\n',
 u'\n',
 u'\xa0',
 u'Future Reports',
 u'\n',
 u'24',
 u'12-',
 u'Breast Cancer Research Fund',
 u'25',
 u'12-31-15',
 u'Cigarette and Tobacco Products Surtax Research Program',
 u'26',
 u'01-01-16',
 u'Best Value Program',
 u'27',
 u'01-01-16',
 u'California Subject Matter Programs',
 u'28',
 u'04-01-16',
 u'COSMOS Program Outcomes']

Now we'll use a for loop to go through the list and grab only the cells with a pdf file in them, we'll also need to keep track of the index to set up the date of the report.


In [9]:
# Set up empty lists
reports = []
date = []

# Se tindex counter
index = 0

# Go find the pdf cells
for item in data:
    if 'pdf' in item:
        # Add the date and reports
        date.append(data[index-1])
        
        # Get rid of \xa0
        reports.append(item.replace(u'\xa0', u' '))
                    
    index += 1

You'll notice a line to take care of '\xa0 ' This is due to a unicode error that occurs if you don't do this. Web pages can be messy and inconsistent and it is very likely you'll have to do some research to take care of problems like these.

Here's the link I used to solve this particular issue: StackOverflow Page

Now all that is left is to organize our data into a pandas DataFrame!


In [10]:
# Set up Dates and Reports as Series
date = Series(date)
reports = Series(reports)

In [11]:
# Concatenate into a DataFrame
legislative_df = pd.concat([date,reports],axis=1)

In [12]:
# Set up the columns
legislative_df.columns = ['Date','Reports']

In [13]:
# Show the finished DataFrame
legislative_df


Out[13]:
Date Reports
0 08/01/13 2013-14 (EDU 92495) Proposed Capital Outlay Pr...
1 09/01/13 2014-15 (EDU 92495) Proposed Capital Outlay P...
2 11/01/13 Utilization of Classroom and Teaching Laborato...
3 11/01/13 Instruction and Research Space Summary & Analy...
4 11/15/13 Statewide Energy Partnership Program (pdf)
5 11/30/13 2013-23 Capital Financial Plan (pdf)
6 11/30/13 Projects Savings Funded from Capital Outlay Bo...
7 12/01/13 Streamlined Capital Projects Funded from Capit...
8 01/01/14 Annual General Obligation Bonds Accountability...
9 01/01/14 Small Business Utilization (pdf)
10 01/01/14 Institutional Financial Aid Programs - Prelimi...
11 01/10/14 Summer Enrollment (pdf)
12 01/15/14 Contracting Out for Services at Newly Develope...
13 03/01/14 Performance Measures (pdf)
14 03/01/14 Entry Level Writing Requirement (pdf)
15 03/31/14 Annual Report on Student Financial Support (pdf)
16 04/01/14 Unique Statewide Pupil Identifier (pdf)
17 04/01/14 Riverside School of Medicine (pdf)
18 05/15/14 Receipt and Use of Lottery Funds (pdf)
19 07/01/14 Cogeneration and Energy Consv Major Capital Pr...

There are other less intense options for web scraping:

Check out these two companies:

https://import.io/

https://www.kimonolabs.com/

Aside


In [1]:
# http://docs.python-guide.org/en/latest/scenarios/scrape/

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)

# inspect element
# <div title="buyer-name">Carson Busses</div>
# <span class="item-price">$29.95</span>

#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices


Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', '$15.00', '$114.07', '$10.09']

In [ ]:
# https://www.flightradar24.com/56.16,-52.58/7
# http://stackoverflow.com/questions/39489168/how-to-scrape-real-time-streaming-data-with-python

# If you look at the network tab in the developer console in Chrome (for example), you'll see the requests to https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1


import requests
from bs4 import BeautifulSoup
import time

def get_count():
    url = "https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=57.78,54.11,-56.40,-48.75&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1"
    
    # Request with fake header, otherwise you will get an 403 HTTP error
    r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
    # Parse the JSON
    data = r.json()
    counter = 0
    
    # Iterate over the elements to get the number of total flights
    for element in data["stats"]["total"]:
        counter += data["stats"]["total"][element]
    
    return counter

while True:
    print(get_count())
    time.sleep(8)

# Hmm, that was just my first thaught. As I wrote, the code is not meant as something final

Good Job!